llm fine-tuning
MobileFineTuner: A Unified End-to-End Framework for Fine-Tuning LLMs on Mobile Phones
Geng, Jiaxiang, Zhao, Lunyu, Lu, Yiyi, Luo, Bing
Mobile phones are the most ubiquitous end devices, generating vast amounts of human-authored data and serving as the primary platform for end-side applications. As high-quality public data for large language models (LLMs) approaches exhaustion, on-device fine-tuning provides an opportunity to leverage private user data while preserving privacy. However, existing approaches are predominantly simulation-based or rely on IoT devices and PCs, leaving commodity mobile phones largely unexplored. A key gap is the absence of an open-source framework that enables practical LLM fine-tuning on mobile phones. We present MobileFineTuner, a unified open-source framework that enables end-to-end LLM fine-tuning directly on commodity mobile phones. MobileFineTuner is designed for efficiency, scalability, and usability, supporting full-parameters fine-tuning (Full-FT) and parameter-efficient fine-tuning (PEFT). To address the memory and energy limitations inherent to mobile phones, we introduce system-level optimizations including parameter sharding, gradient accumulation, and energy-aware computation scheduling. We demonstrate the practicality of MobileFineTuner by fine-tuning GPT-2, Gemma 3, and Qwen 2.5 on real mobile phones. Extensive experiments and ablation studies validate the effectiveness of the proposed optimizations and establish MobileFineTuner as a viable foundation for future research on on-device LLM training.
Curvature-Aware Safety Restoration In LLMs Fine-Tuning
Bach, Thong, Nguyen-Tang, Thanh, Nguyen, Dung, Le, Thao Minh, Tran, Truyen
Large Language Models (LLMs) encode safety-aligned behaviors during pretraining, but these safeguards deteriorate during task-specific fine-tuning, a phenomenon we identify as safety alignment drift. Studies demonstrate that even minimal fine-tuning can compromise safety mechanisms, with models like GPT -3.5 Turbo becoming consistently unsafe after adaptation on just 10 adversarial examples [Qi et al., 2023a]. Attempts to address this issue by modifying model behavior generally fall into two main categories, both of which suffer from inherent limitations. Behavioral unlearning methods attempt to remove undesirable knowledge or responses [Cao and Y ang, 2015, Bourtoule et al., 2021a], but often require costly retraining or risk catastrophic forgetting. Model editing approaches aim to update factual associations or local behaviors through direct parameter intervention [Meng et al., 2022, Mitchell et al., 2022], yet struggle to generalize beyond narrow scopes or isolated prompts. To solve these issues, we propose a new direction that treats safety behavior as an intrinsic property of the model's geometry and seeks to restore alignment through curvature-aware navigation of the loss landscape. Our key insight, supported by extensive empirical analysis (Section 2), is that models preserve notable structural properties in their loss landscapes with respect to harmful content after fine-tuning. Specifically, we observe high correlations in models' responses to harmful inputs before and after fine-tuning, despite substantial divergence in other functional behaviors. This suggests that safety mechanisms remain largely preserved in the parameter space, merely shifted to less dominant regions during task-specific optimization.
Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT
Lee, Hee-Jin, Guo, Zhen, Jin, Luchao, Goudarzi, Morteza Moazami
We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.
VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation
Calzada, Paul E., Ibnat, Zahin, Rahman, Tanvir, Kandula, Kamal, Lu, Danyu, Saha, Sujan Kumar, Farahmandi, Farimah, Tehranipoor, Mark
Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation. In this work, we examine the current literature on RTL generation using LLMs and identify key requirements for training and fine-tuning datasets. We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management with PostgreSQL, data collection from code hosting sites like OpenCores and GitHub, and data preprocessing to verify the codes' syntax, run logic synthesis, and extract relevant module metadata. We implement a scalable and efficient DB infrastructure to support analysis and detail our preprocessing pipeline to enforce high-quality data before DB insertion. The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest high-quality Verilog dataset for LLM fine-tuning to our knowledge. We further evaluate the dataset, address associated challenges, and explore potential applications for future research and development in LLM-based hardware generation.
A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets
Lagasse, Ryan, Kierans, Aidan, Ghosh, Avijit, Dori-Hacohen, Shiri
We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of examples and their average token length--what we term dataset volume --play a decisive role in model performance. Experiments on the BRICC dataset Salavati et al. (2024) and subsets of the MMLU dataset Hendrycks et al. (2021), evaluated under multiple subsampling strategies, reveal that data composition significantly affects token efficiency. These results motivate refined scaling laws for practical LLM fine-tuning in resource-constrained settings. Code will be made available upon publication.
MoRE: A Mixture of Low-Rank Experts for Adaptive Multi-Task Learning
Zhang, Dacao, Zhang, Kun, Chu, Shimao, Wu, Le, Li, Xin, Wei, Si
With the rapid development of Large Language Models (LLMs), Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant attention, which aims to achieve efficient fine-tuning of LLMs with fewer parameters. As a representative PEFT method, Low-Rank Adaptation (LoRA) introduces low-rank matrices to approximate the incremental tuning parameters and achieves impressive performance over multiple scenarios. After that, plenty of improvements have been proposed for further improvement. However, these methods either focus on single-task scenarios or separately train multiple LoRA modules for multi-task scenarios, limiting the efficiency and effectiveness of LoRA in multi-task scenarios. To better adapt to multi-task fine-tuning, in this paper, we propose a novel Mixture of Low-Rank Experts (MoRE) for multi-task PEFT. Specifically, instead of using an individual LoRA for each task, we align different ranks of LoRA module with different tasks, which we named low-rank experts. Moreover, we design a novel adaptive rank selector to select the appropriate expert for each task. By jointly training low-rank experts, MoRE can enhance the adaptability and efficiency of LoRA in multi-task scenarios. Finally, we conduct extensive experiments over multiple multi-task benchmarks along with different LLMs to verify model performance. Experimental results demonstrate that compared to traditional LoRA and its variants, MoRE significantly improves the performance of LLMs in multi-task scenarios and incurs no additional inference cost. We also release the model and code to facilitate the community.
KerZOO: Kernel Function Informed Zeroth-Order Optimization for Accurate and Accelerated LLM Fine-Tuning
Mi, Zhendong, Tan, Qitao, Yu, Xiaodong, Zhu, Zining, Yuan, Geng, Huang, Shaoyi
Large language models (LLMs) have demonstrated impressive capabilities across numerous NLP tasks. Nevertheless, conventional first-order fine-tuning techniques impose heavy memory demands, creating practical obstacles to real-world applications. Zeroth-order (ZO) optimization has recently emerged as a promising memory-efficient alternative, as it circumvents the need for backpropagation by estimating gradients solely through forward passes--making it particularly suitable for resource-limited environments. Despite its efficiency, ZO optimization suffers from gradient estimation bias, which significantly hinders convergence speed. To address this, we analytically identify and characterize the lower-order bias introduced during ZO-based gradient estimation in LLM fine-tuning. Motivated by tools in mathematical physics, we introduce a kernel-function-based ZO framework aimed at mitigating this bias and improving optimization stability. KerZOO achieves comparable or superior performance to existing ZO baselines in both full-parameter and parameter-efficient fine-tuning settings of LLMs, while significantly reducing the number of iterations required to reach convergence. For example, KerZOO reduces total GPU training hours by as much as 74% and 44% on WSC and MultiRC datasets in fine-tuning OPT-2.7B model and can exceed the MeZO baseline by 2.9% and 2.6% in accuracy. We show that the kernel function is an effective avenue for reducing estimation bias in ZO methods.
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
Chen, Pin-Yu, Shen, Han, Das, Payel, Chen, Tianyi
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.
MobiLLM: Enabling LLM Fine-Tuning on the Mobile Device via Server Assisted Side Tuning
Li, Liang, Yang, Xingke, Wu, Wen, Wang, Hao, Ohtsuki, Tomoaki, Fu, Xin, Pan, Miao, Shen, Xuemin
Large Language Model (LLM) at mobile devices and its potential applications never fail to fascinate. However, on-device LLM fine-tuning poses great challenges due to extremely high memory requirements and slow training speeds. Even with parameter-efficient fine-tuning (PEFT) methods that update only a small subset of parameters, resource-constrained mobile devices cannot afford them. In this paper, we propose MobiLLM to enable memory-efficient transformer LLM fine-tuning on a mobile device via server-assisted side-tuning. Particularly, MobiLLM allows the resource-constrained mobile device to retain merely a frozen backbone model, while offloading the memory and computation-intensive backpropagation of a trainable side-network to a high-performance server. Unlike existing fine-tuning methods that keep trainable parameters inside the frozen backbone, MobiLLM separates a set of parallel adapters from the backbone to create a backpropagation bypass, involving only one-way activation transfers from the mobile device to the server with low-width quantization during forward propagation. In this way, the data never leaves the mobile device while the device can remove backpropagation through the local backbone model and its forward propagation can be paralyzed with the server-side execution. Thus, MobiLLM preserves data privacy while significantly reducing the memory and computational burdens for LLM fine-tuning. Through extensive experiments, we demonstrate that MobiLLM can enable a resource-constrained mobile device, even a CPU-only one, to fine-tune LLMs and significantly reduce convergence time and memory usage.
Leveraging Conditional Mutual Information to Improve Large Language Model Fine-Tuning For Classification
Sivakaran, Thanushon, Yang, En-Hui
Although large language models (LLMs) have demonstrated remarkable capabilities in recent years, the potential of information theory (IT) to enhance LLM development remains underexplored. This paper introduces the information theoretic principle of Conditional Mutual Information (CMI) to LLM fine-tuning for classification tasks, exploring its promise in two main ways: minimizing CMI to improve a model's standalone performance and maximizing CMI to enhance knowledge distillation (KD) for more capable student models. To apply CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, which was initially developed for image classification, with some modification. By minimizing CMI during LLM fine-tuning, we achieve superior performance gains on 6 of 8 GLUE classification tasks compared to BERT. Additionally, maximizing CMI during the KD process results in significant performance improvements in 6 of 8 GLUE classification tasks compared to DistilBERT. These findings demonstrate CMI's adaptability for optimizing both standalone LLMs and student models, showcasing its potential as a robust framework for advancing LLM fine-tuning. Our work bridges the gap between information theory and LLM development, offering new insights for building high-performing language models.